-
Notifications
You must be signed in to change notification settings - Fork 260
Fix cns zap logger issue #4069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix cns zap logger issue #4069
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a CNS crash issue caused by aggressive stacktrace capture in zap logging during metrics recording. The crash occurs when zap attempts to capture stacktraces on Info-level logs in hot paths, leading to Windows access violations.
- Changed unbuffered channel to buffered channel (size 1) to prevent signal drops during metrics recording
- Added stacktrace capture only for WarnLevel and above to reduce aggressive stacktrace capture on Info logs
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| cns/restserver/metrics.go | Changed channel from unbuffered to buffered (size 1) to prevent signal drops |
| cns/logger/cnslogger.go | Added stacktrace capture configuration to limit it to WarnLevel and above |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
rbtr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes seem superficial, so I'm confused about what the underlying issue is here. from the description it sounds like sometimes on Windows getting a stacktrace panics? but if we're getting a stacktrace I expect we're in a catastrophic failure already?
I don't think buffering a channel is going to help resolve that.
| zapLogger := zap.New( | ||
| platformCore, | ||
| zap.AddCaller(), | ||
| zap.AddStacktrace(zapcore.WarnLevel), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why Warn? imo we don't need stacks until Error+
| recorder.once.Do(func() { | ||
| recorder.podIPConfigSrc = service.PodIPConfigStates | ||
| recorder.sig = make(chan struct{}) | ||
| recorder.sig = make(chan struct{}, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disagree with this change, unless this can't be called more than twice concurrently (and if that was the case, why twice instead of once and leave this unbuffered?). this smells and indicates some other structural problem.
all it does is delay when you will get a blocking send by one more event. maybe sig should only have non-blocking sends?
Reason for Change:
Customers found CNS crash sometimes due to cns zap logging on RequestIPConfigsHander() which defers service.publishIPStateMetrics() → every request ends by signaling the recorder:
publishIPStateMetrics() spins go recorder.run() (once) and sends on recorder.sig.
run() calls record(), which logs the pool watermarks via: logger.Printf("Allocated IPs: %d, ...", ...)
The crash stack shows it dies inside zap while capturing a stack trace. So zap's stacktrace capture on an Info log that fires immediately after handler returns, causing windows access violation. On windows, stacktrace.Capture() can crash under load, so disabling it could remove the crash path.
CNS zap’s stacktrace.Capture() uses runtime.Callers, and we’ve seen rare access violations in that path when the stacktrace capture is aggressive. As the result, capturing stacktraces on hot Info paths can crash the process in your environment.
In CNS metrics.go, the sig is unbuffered:
recorder.sig = make(chan struct{})
When send with a non-blocking select dropping the signal if run() is not parked on receive. An unbuffered send succeeds only if a receiver is already waiting. If run() is not parked on <-a.sig, the default branch runs and the signal could be silently dropped. So making this buffered size 1 will bring up:
Issue Fixed:
Requirements:
Notes: